test: add real runAgentInSandbox() E2E tests for telegram injection#1267
test: add real runAgentInSandbox() E2E tests for telegram injection#1267jyaunches wants to merge 2 commits intoNVIDIA:mainfrom
Conversation
Shared E2E infrastructure improvements that make Brev-based CI reliable on CPU-only instances: brev-setup.sh: - Extract HAS_GPU flag for consistent GPU detection — nvidia-smi must run successfully, not just exist on PATH (Brev GPU images ship the binary even on CPU instances) - All GPU gates (container toolkit, Docker runtime reset, vLLM) use the single HAS_GPU flag - Replace cloud-init || true with proper error check + warning - Add timeout (300s) and quiet window (5s) to apt wait loop to prevent indefinite hangs and races - Add retry loops (5 attempts, 30s backoff) for Node.js and Docker apt-get installs - Unsuppress NodeSource installer output for debugging - Reset Docker default runtime to runc on CPU-only instances where Brev pre-configures nvidia as default setup.sh: - Check nvidia-smi exit code instead of just PATH presence for GPU gateway flag brev-e2e.test.js: - Use known GCP instance type (n2-standard-4) instead of cpu search - Add shellQuote() for safe secret interpolation in SSH commands - Delete leftover instances before creating new ones - Wait for cloud-init before running bootstrap
Add Phases 7-11 to test-telegram-injection.sh addressing PR NVIDIA#1092 feedback from @cv: exercise real production code paths instead of ad-hoc SSH commands. New test phases: - Phase 7: Real runAgentInSandbox() with $(command) and backtick injection payloads (T9-T11) - Phase 8: Multi-line message handling with embedded injection (T12-T13) - Phase 9: Session ID option injection prevention via sanitizeSessionId, including negative Telegram chat IDs (T14-T15) - Phase 10: Empty and special-character-only session IDs (T16-T17) - Phase 11: Cleanup of marker files and temp scripts These tests are expected to FAIL on unfixed code (runAgentInSandbox is not exported, signature doesn't accept options), proving they detect the vulnerability. They will pass once PR NVIDIA#119 is merged. Refs: NVIDIA#118, PR NVIDIA#1092
📝 WalkthroughWalkthroughScripts across provisioning, setup, and testing were updated. GPU detection now validates successful execution rather than command presence. Apt access is serialized with cloud-init synchronization, background service termination, process killing, lock clearing, and idle-loop verification. Node.js and Docker installations use retry logic with backoff. E2E tests now specify instance types directly and await cloud-init completion. Telegram injection tests expand coverage with sandboxing and session ID sanitization validation. Changes
Sequence Diagram(s)sequenceDiagram
actor Script as Brev Setup<br/>Script
participant CloudInit as cloud-init
participant SystemD as systemd<br/>(timers)
participant AptDpkg as apt/dpkg<br/>Processes
participant Locks as Lock<br/>Files
participant DpkgConfig as dpkg<br/>--configure
Script->>CloudInit: Wait for cloud-init<br/>status --wait
CloudInit-->>Script: Ready
Script->>SystemD: Stop/disable apt-daily*<br/>& unattended-upgrades
SystemD-->>Script: Stopped
Script->>AptDpkg: Force-kill apt-get,<br/>apt, dpkg processes
AptDpkg-->>Script: Terminated
Script->>Locks: Clear apt/dpkg<br/>lock files
Locks-->>Script: Cleared
Script->>DpkgConfig: Run dpkg --configure -a
DpkgConfig-->>Script: Configuration complete
loop Until Idle or Timeout
Script->>AptDpkg: Check for<br/>running processes
alt Idle Window Detected
AptDpkg-->>Script: No activity ✓
Script->>Script: Exit loop
else Still Active
AptDpkg-->>Script: Still running
Script->>Script: Wait & retry
end
end
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Poem
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Warning Review ran into problems🔥 ProblemsTimed out fetching pipeline failures after 30000ms Comment |
There was a problem hiding this comment.
🧹 Nitpick comments (1)
test/e2e/test-telegram-injection.sh (1)
569-595: Consider running the backgrounded command in an explicit subshell for clarity.The background job at lines 569-571 uses
cd "$REPO" && ...which implicitly creates a subshell when backgrounded. While this works, an explicit subshell(cd "$REPO" && ...)would make the intent clearer and ensure consistent behavior across shell versions.That said, the current implementation is functionally correct—
kill $BRIDGE_PIDwill signal the job leader, andwait $BRIDGE_PIDwill reap it properly.🔧 Optional: explicit subshell
- cd "$REPO" && OPENSHELL_PATH="$OPENSHELL_PATH" \ + (cd "$REPO" && OPENSHELL_PATH="$OPENSHELL_PATH" \ NEMOCLAW_SANDBOX_NAME="$SANDBOX_NAME" \ - timeout 60 node /tmp/test-real-bridge.js 'Hello world' 'e2e-ps-check' & + timeout 60 node /tmp/test-real-bridge.js 'Hello world' 'e2e-ps-check') &🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@test/e2e/test-telegram-injection.sh` around lines 569 - 595, The backgrounded command that sets OPENSHELL_PATH/NEMOCLAW_SANDBOX_NAME and runs node /tmp/test-real-bridge.js should be executed in an explicit subshell to make the intent clear and ensure consistent behavior across shells; wrap the current `cd "$REPO" && ...` invocation in a subshell (i.e., `(cd "$REPO" && OPENSHELL_PATH="..." NEMOCLAW_SANDBOX_NAME="..." timeout 60 node /tmp/test-real-bridge.js 'Hello world' 'e2e-ps-check') &` so BRIDGE_PID continues to be captured from `$!` and the environment/working-directory change is isolated. Ensure BRIDGE_PID is still assigned immediately after backgrounding and that the later kill $BRIDGE_PID and wait $BRIDGE_PID logic remains unchanged.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@test/e2e/test-telegram-injection.sh`:
- Around line 569-595: The backgrounded command that sets
OPENSHELL_PATH/NEMOCLAW_SANDBOX_NAME and runs node /tmp/test-real-bridge.js
should be executed in an explicit subshell to make the intent clear and ensure
consistent behavior across shells; wrap the current `cd "$REPO" && ...`
invocation in a subshell (i.e., `(cd "$REPO" && OPENSHELL_PATH="..."
NEMOCLAW_SANDBOX_NAME="..." timeout 60 node /tmp/test-real-bridge.js 'Hello
world' 'e2e-ps-check') &` so BRIDGE_PID continues to be captured from `$!` and
the environment/working-directory change is isolated. Ensure BRIDGE_PID is still
assigned immediately after backgrounding and that the later kill $BRIDGE_PID and
wait $BRIDGE_PID logic remains unchanged.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 1a1d5163-1d85-4ee7-8e01-e6c247d15bea
📒 Files selected for processing (4)
scripts/brev-setup.shscripts/setup.shtest/e2e/brev-e2e.test.jstest/e2e/test-telegram-injection.sh
|
Closing this PR — per @aerickson's direction, the Telegram bridge is going away entirely, making this injection test work redundant/no longer needed. |
Summary
Add Phases 7-11 to
test-telegram-injection.shaddressing PR #1092 feedback from @cv: exercise real production code paths instead of ad-hoc SSH commands.Depends on #1266 (Brev E2E infra hardening) — merge that first.
Split from #1219 to keep test logic separate from infrastructure changes.
Changes
test/e2e/test-telegram-injection.sh
New test phases:
runAgentInSandbox()with $(command) and backtick injection payloads (T9-T11)sanitizeSessionId, including negative Telegram chat IDs (T14-T15)These tests are expected to FAIL on unfixed code (
runAgentInSandboxis not exported, signature doesn't accept options), proving they detect the vulnerability. They will pass once PR #119 is merged.Testing
npm test— CLI tests passType of Change
Refs: #118, PR #1092
Summary by CodeRabbit
Release Notes
New Features
Bug Fixes
Tests